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Abstract 

The  paper  introduces  the  concept  of  a  negative  database,  in  which  a  set  of  records  DB  is  represented  by 
its  complement  set.  That  is,  all  the  records  not  in  DB  are  represented,  and  DB  itself  is  not  explicitly 
stored.  After  introducing  the  concept,  several  results  are  given  regarding  the  feasibility  of  such  a  scheme 
and  its  potential  for  enhancing  privacy.  It  is  shown  that  a  database  consisting  of  n,  bbit  records  can 
be  represented  negatively  using  only  0(ln)  records.  It  is  also  shown  that  membership  queries  for  DB 
can  be  processed  against  the  negative  representation  in  time  no  worse  than  linear  in  its  size  and  that 
reconstructing  the  database  DB  represented  by  a  negative  database  NDB  given  as  input  is  an  .VP-hard 
problem  when  time  complexity  is  measured  as  a  function  of  the  size  of  NDB. 


1  Introduction 

Large  collections  of  data  are  ubiquitous,  and  the  demands  that  will  be  placed  on  these  collections  in  the 
near  future  are  increasing.  We  expect  them  to  be  available  when  we  need  them;  we  expect  them  not  to 
be  available  to  malicious  parties;  the  contents  of  the  collections  and  the  rules  for  accessing  them  must  be 
continually  updated;  we  would  like  to  be  able  to  search  them  in  new  ways,  drawing  inferences  about  large- 
scale  patterns  and  trends;  we  want  to  be  protected  from  the  wrong  kinds  of  inferences  being  made  (as  in 
racial  profiling);  and,  eventually,  we  will  want  the  ability  to  audit  the  uses  to  which  our  personal  data  are 
put.  Although  many  of  these  problems  are  old,  they  must  now  be  solved  more  quickly  for  larger  and  more 
dynamic  collections  of  data. 

In  this  paper  we  introduce  an  approach  to  representing  data  that  addresses  some  of  these  issues.  In 
our  approach,  the  negative  image  of  a  set  of  data  records  is  represented  rather  than  the  records  themselves 
(Figure  2).  Initially,  we  assume  a  universe  U  of  finite-length  records  (or  strings),  all  of  the  same  length  l, 
and  defined  over  a  binary  alphabet.  We  logically  divide  the  space  of  possible  strings  into  two  disjoint  sets: 
DB  representing  the  set  positive  records  (holding  the  information  of  interest),  and  U  —  DB  denoting  the  set 
of  all  strings  not  in  DB.  We  assume  that  DB  is  uncompressed  (each  record  is  represented  explicitly),  but 
we  allow  U  —  DB  to  be  stored  in  a  compressed  form  called  NDB.  We  refer  to  DB  as  the  positive  database 
and  NDB  as  the  negative  database. 

From  a  logical  point  of  view,  either  representation  will  suffice  to  answer  questions  regarding  DB.  However, 
the  different  representations  may  present  different  advantages.  For  instance,  in  a  positive  database,  inspection 
of  a  single  record  provides  meaningful  information.  However,  inspection  of  a  single  (negative)  record  reveals 
little  meaningful  information  about  the  contents  of  the  original  database.  Because  the  positive  tuples  are 
never  stored  explicitly,  a  negative  database  would  be  much  more  difficult  to  misuse.  Similarly,  depending  on 
the  representation  for  NDB ,  the  efficiency  of  certain  kinds  of  queries  may  be  significantly  different  than  the 
efficiency  of  the  same  query  under  DB. 

*  University  of  New  Mexico  Technical  Report. 


1 


Report  Documentation  Page 


Form  Approved 
OMB  No.  0704-0188 


Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 


1.  REPORT  DATE 

2004 


2.  REPORT  TYPE 


3.  DATES  COVERED 

00-00-2004  to  00-00-2004 


5a.  CONTRACT  NUMBER 


5b.  GRANT  NUMBER 


5c.  PROGRAM  ELEMENT  NUMBER 


5d.  PROJECT  NUMBER 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 


4.  TITLE  AND  SUBTITLE 

Enhancing  Privacy  through  Negative  Representations  of  Data 


6.  AUTHOR(S) 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES)  8.  PERFORMING  ORGANIZATION 

University  of  New  Mexico, Computer  Science  Department  report  number 

,Albuquer  que,NM,8713 1 

9.  SPONSORING/MONITORING  AGENCY  NAME(S )  AND  ADDRESS(ES )  10.  SPONSOR/MONITOR' S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

The  paper  introduces  the  concept  of  a  negative  database,  in  which  a  set  of  records  DB  is  represented  by  its 
complement  set.  That  is,  all  the  records  not  in  DB  are  represented,  and  DB  itself  is  not  explicitly  stored. 
After  introducing  the  concept,  several  results  are  given  regarding  the  feasibility  of  such  a  scheme  and  its 
potential  for  enhancing  privacy.  It  is  shown  that  a  database  consisting  of  n,  1-bit  records  can  be  represented 
negatively  using  only  O(ln)  records.  It  is  also  shown  that  membership  queries  for  DB  can  be  processed 
against  the  negative  representation  in  time  no  worse  than  linear  in  its  size  and  that  reconstructing  the 
database  DB  represented  by  a  negative  database  NDB  given  as  input  is  an  NP-hard  problem  when  time 
complexity  is  measured  as  a  function  of  the  size  of  NDB. 

15.  SUBJECT  TERMS 


16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 

18.  NUMBER 

19a.  NAME  OF 

ABSTRACT 

OF  PAGES 

RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Same  as 
Report  (SAR) 

12 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


Some  applications  may  benefit  from  this  change  of  perspective.  Most  applications  seek  to  retrieve  infor¬ 
mation  about  DB  as  efficiently  and  accurately  as  possible,  and  they  typically  are  not  explicitly  concerned 
with  U  —  DB.  Yet,  in  situations  where  privacy  is  a  concern  it  may  be  useful  to  adopt  a  scheme  in  which 
certain  queries  are  efficient  and  others  are  provably  inefficient. 

Current  technologies  of  encryption  (for  the  data  itself)  and  query  restriction  (for  controlling  access  to 
the  data)  help  ensure  confidentiality,  but  neither  solution  is  appropriate  for  all  applications.  In  the  case  of 
encryption,  the  ability  to  search  data  records  is  hindered,  while  in  the  case  of  query  restriction,  individual 
records  are  vulnerable  to  insider  attacks.  The  method  presented  here  potentially  addresses  both  of  these 
concerns. 

In  the  following  sections,  we  first  show  that  implementing  NDB  is  computationally  feasible.  We  do  this 
by  introducing  a  representational  scheme  that  requires  0(ln)  negative  records  to  represent  a  positive  database 
consisting  of  n,  Tbit  records,  and  then  giving  an  algorithm  for  finding  such  a  representation  efficiently  from 
any  finite  DB.  This  representation  is  known  as  the  prefix  representation.  The  prefix  representation  supports 
simple  membership  queries1,  insertions,  and  deletions.  We  then  investigate  some  of  the  implications  of 
the  negative  scheme  for  privacy.  In  particular,  we  show  that  the  general  problem  of  recovering  a  positive 
database  from  our  negative  representation  is  ATP-hard,  and  we  present  a  randomized  algorithm  for  creating 
negative  representations  that  are  difficult  to  reverse.  Finally,  we  review  related  work,  discuss  the  potential 
consequences  of  our  results,  and  outline  areas  of  future  investigation. 


2  Representation 

In  order  to  create  a  database  NDB  that  is  reasonable  in  size,  we  must  compress  the  information  contained 
in  U  —  DB  but  retain  the  ability  to  answer  queries.  We  introduce  one  additional  symbol  to  our  binary 
alphabet,  known  as  a  “don’t  care,”  written  as  *.  The  entries  in  NDB  will  thus  be  /-length  strings  over  the 
alphabet  {0, 1,  *}.  The  don’t-care  symbol  has  the  usual  interpretation  and  will  match  either  a  one  or  a  zero 
at  the  bit  position  where  the  *  appears.  Positions  in  a  string  that  are  set  either  to  one  or  zero  are  referred 
to  as  “defined  positions.”  With  this  new  symbol  we  can  potentially  represent  large  subsets  of  U  —  DB  with 
just  a  few  entries. 

For  example,  the  set  of  strings  U  —  DB  can  be  exactly  represented  by  the  NDB  set  depicted  shown 
below: 


DB 

(U  -  DB) 

NDB 

001 

000 

010 

0*1 

111 

011  => 

*10 

100 

10* 

101 

110 

The  convention  is  that  a  string  s  is  taken  to  be  in  DB  if  and  only  if  s  fails  to  match  all  the  entries  in  NDB. 
This  condition  is  fulfilled  only  if  for  every  string  tj  £  NDB ,  s  disagrees  with  tj  in  at  least  one  defined 
position. 

2.1  The  Prefix  Algorithm 

In  this  section  we  present  an  algorithm  as  proof  that  a  negative  database  NDB  can  be  constructed  in 
reasonable  time  and  of  reasonable  size.  The  prefix  algorithm  introduced  here  is  deterministic  and  reversible, 
which  has  consequences  for  the  kinds  of  inferences  that  can  be  made  efficiently  from  NDB.  We  would  like 
some  inferences  to  be  hard  (e.g.,  inferring  the  original  DB  from  NDB )  and  other  inferences  to  be  easy, 
depending  on  the  application  (e.g.,  finding  certain  kinds  of  correlations  in  DB).  However,  in  this  paper, 

1  Although  indexing  schemes  could  be  developed  to  support  truly  efficient  membership  queries,  our  current  emphasis  is  on 
demonstrating  the  dichotomy  between  tractable  and  intractable  queries. 
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we  will  focus  only  on  the  question  of  how  easy  it  is  to  recover  the  original  DB  from  NDB ,  a  question  we 
address  in  Section  3. 


Prefix  algorithm 

Let  Wi  denote  an  i- bit  prefix  and  Wt  a  set 
of  z-length  bit  patterns. 

1.  i  —  0 

2.  Set  Wi  to  the  empty  set 

3.  Set  Wl+i  to  every  pattern  not  present  in 
DB' s  Wi+i  but  with  prefix  in  Wi 

4.  for  each  pattern  Vp  in  Wi+ 1{ 

5.  Create  a  record  using  Vp  as  its  prefix 
and  the  remaining  positions  set  to  the 
don’t  care  symbol. 

6.  Add  record  to  NDB.} 

7.  Increment  i  by  one 

8.  Set  Wi  to  every  pattern  in  DB' s  vjt 

9.  Return  to  step  3  as  long  as  i  <1. 


Figure  1:  The  Prefix  algorithm  outputs  a  negative  database  NDB  of  size  0(l\DB\)  representing  the  strings 
in  U  -  DB. 


DB 

U  -  DB 

NDB 

c-keys 

RNDB 

0001 

0000 

H** 

H** 

11** 

0100 

0010 

001* 

0*1* 

0*1* 

1000 

0011 

Oil* 

*H* 

1110 

1011 

0101 

0000 

00*0 

*111 

0110 

0101 

*1*1 

00*0 

0111 

1001 

1*01 

*1*1 

1001 

1010 

**10 

0101 

1010 

1*01 

1100 

**10 

1101 

*010 

1110 

1111 

Figure  2:  Column  1  gives  an  example  DB ,  column  2  gives  the  corresponding  U  —  DB,  column  3  gives  the 
corresponding  NDB  generated  by  the  prefix  algorithm,  column  4  gives  an  example  output  of  RNDB,  and 
column  5  presents  some  possible  c-keys  extracted  from  NDB  (see  section  4). 


Lemma  2.1.1.  The  prefix  algorithm  creates  a  database  NDB  that  matches  exactly  those  strings  not  in 
DB. 

Proof.  Step  three  of  the  algorithm  (Fig.  1)  finds  every  prefix  not  present  in  DB  that  has  not  already  been 
inserted  in  NDB.  It  then  appends  every  possible  string  with  that  prefix  to  NDB  (step  5).  If  a  DB  pattern 
is  not  present  in  window  Wi+ 1  and  its  prefix  is  not  in  wy  then  it  must  have  been  inserted  in  NDB  before. 
Step  two  initializes  Wq  so  that  the  first  iteration  considers  every  pattern  absent  from  DB.  □ 

Theorem  2.1.1.  The  negative  data  set  ( U  —  DB)  can  be  represented  using  0(l\DB\)  records. 
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Proof.  For  every  window  of  size  i  there  are  at  most  \DB\  “negative”  records  created  and  inserted  in  NDB 
(steps  4-6).  The  number  of  windows  is  at  most  l  (step  9)  therefore,  the  number  of  negative  records  is 
0(l\DB\).  □ 


The  NDB  produced  by  the  prefix  algorithm  has  some  interesting  properties.  For  example,  each  record 
of  NDB  uniquely  covers  a  subset  of  U  —  DB.  This  nonoverlapping  property  allows  NDB  to  support  more 
powerful  queries  than  simple  membership.  Questions  like  ’’Are  there  any  engineers  in  DB ?”  can  be  answered 
by  finding  all  records  that  match  ’engineer’  in  the  corresponding  field  of  NDB  and  simply  counting  whether 
these  records  completely  represent  the  subset  of  U  that  contains  the  engineers.  An  example  DB,  U  —  DB 
and  the  NDB  produced  by  the  prefix  algorithm  is  given  in  Fig.  2. 


3  Reversibility 

In  section  2.1  we  presented  an  algorithm  for  generating  NDB  that  easily  demonstrates  the  feasibility  of  a 
negative  representation.  In  what  follows  we  turn  our  attention  to  the  goal  of  making  DB  hard  to  reconstruct. 
First  we  establish  that  the  representation  is  potentially  difficult  to  reverse,  and  then  we  present  an  algorithm 
which  indeed  produces  hard  to  reverse  instances. 

Reconstruction  of  DB  from  NDB  is  VP-liard  in  the  following  sense2. 

Definition  3.0.1.  Self  Recognition  (SR): 

INPUT:  U  —  DB  represented  by  a  collection  NDB  of  length  l  bit  strings,  such  that  each  string  may  contain 
any  number  of  *  symbols,  and  a  candidate  self  set  DB. 

QUESTION:  Does  NDB  represent  the  self  set  DB1 

We  establish  SR  is  VP-hard.  Note  that  NDB  represents  an  arbitrary  set  U  —  DB,  and  we  do  not  specify 
how  it  was  obtained.  First  we  establish  the  VP-completeness  of  the  following  problem. 

Definition  3.0.2.  Non-empty  Self  Recognition  (NESR): 

INPUT:  A  set  U  —  DB  of  binary  strings  represented  by  a  collection  NDB  of  length  l  strings  over  the  alphabet 

{0,1,*}. 

QUESTION:  Is  DB  nonempty?  That  is,  is  there  some  string  in  U  =  {0, 1};  not  matched  by  NDB ? 
Theorem  3.0.2.  NESR  is  VP-complete. 

Proof.  NESR  is  clearly  in  VP.  (If  we  guess  a  string,  it  is  easy  to  verify  that  it  is  not  matched  by  comparing 
it  against  every  record  in  NDB.) 

The  VP-completeness  of  NESR  is  established  by  transformation  from  3-SAT.  Start  with  instance  I  of  3- 
SAT.  Let  X  be  the  set  of  variables  {x.j},  and  suppose  l  is  the  number  of  variables.  The  constructed  instance 
of  NESR  will  be  over  length  l  strings.  Each  clause  {Li,  Lj,  Lk}  in  I  (L,  is  a  literal,  which  is  either  ay  or 
x,  complement)  creates  a  length  l  string  in  NDB  as  follows.  All  positions  other  than  i,j,  or  k  contain  *. 
Position  i  contains  0  if  Li  is  xt  and  contains  1  if  Li  is  Xi  (complemented  xf).  A  similar  construction  is  used 
for  the  other  two  literals  Lj  and  L k  in  this  clause. 

Claim:  There  exists  a  truth  assignment  satisfying  I  if  and  only  if  there  exists  a  string  in  U  =  {0, 1}*  not 
matched  by  NDB.  In  the  following,  if  A  is  a  truth  assignment  to  the  variables  in  X,  5(A)  is  the  string  in 
U  obtained  by  setting  the  ith  bit  to  1  if  A  assigns  Xi  =  T  and  the  ith  bit  to  0  if  A  assigns  x,  =  F. 

We  have: 

A  satisfies  I 

<t=>  for  every  clause  Cq  =  {Li,  Lj,Lk},  at  least  one 
literal  is  satisfied 

<t=>  5(A)  fails  to  match  at  least  one  of  the  bits 
i,j,k  of  the  qth  member  of  NDB 

-For  historical  reasons  we  sometimes  refer  to  DB  as  Self  and  U  —  DB  as  Nonself. 
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(generated  from  Cq),  because  uncomplemented  literal 
Li  generates  0  in  the  ith  position  and 
complemented  Li  generates  1  in  ith  position,  and 
similarly  for  Lj,  Lk) 

<t=>  S(A)  is  in  DB. 

□ 

Corollary  3.0.1.  NESR  is  A/P-complete  even  if  every  record  of  NDB  contains  exactly  three  defined  posi¬ 
tions. 

Proof.  Our  transformation  always  produces  such  an  instance  of  NESR.  □ 

Corollary  3.0.2.  Empty  Self  Recognition  (ESR,  the  complement  of  NESR,  answers  YES  if  and  only  if 
NDB  represents  the  empty  set)  is  A/T-hard. 

Proof.  Trivial  Turing  transformation  from  NESR.  □ 

Theorem  3.0.3.  Self  Recognition  (SR,  defined  above)  is  A/T-hard. 

Proof.  We  have  established  this  to  be  the  case  even  when  the  candidate  self  set  DB  is  empty,  and  even  when 
every  member  of  NDB  contains  exactly  three  defined  positions.  □ 

4  The  Randomize  JV  .DR  Algorithm 

The  prefix  algorithm  presented  in  Section  2.1  is  simple  and  demonstrates  that  a  compact  negative  repre¬ 
sentation  NDB  can  be  obtained  from  DB.  Although  we  have  demonstrated  in  Section  3  that  the  general 
problem  of  reversing  a  given  set  NDB  to  obtain  DB  is  MV- hard,  using  the  simple  prefix  algorithm  to  obtain 
NDB  from  DB  raises  two  concerns  regarding  privacy:  (a)  The  prefix  algorithm  produces  only  an  easy  subset 
of  possible  NDB  instances,  and  (b)  If  the  action  of  the  prefix  algorithm  (or  any  algorithm)  that  produces 
NDB  from  DB  could  be  reproduced  by  an  adversary,  then  the  adversary  could  easily  decide  for  a  given 
NDB  and  candidate  DB  whether  NDB  represents  U  —  DB.  (The  two  concerns  are,  of  course,  related,  for 
if  an  algorithm  were  capable  of  producing  only  one  NDB  for  each  DB  it  is  given  as  input,  the  image  of  the 
algorithm  could  not  define  an  A/P-hard  set  of  instances  of  NESR.)  In  this  section,  we  present  a  randomized 
algorithm  (Fig.  3),  the  Randomize  WHS  algorithm  ( RNDB  for  short),  which  addresses  both  of  these  con¬ 
cerns.  The  prefix  algorithm  is  modified  by  introducing  a  sequence  of  random  choices  that  enlarges  the  set  of 
instances  of  NDB  it  can  produce,  so  that  the  reversibility  of  the  problem  instances  in  the  algorithm’s  image 
defines  an  A/T-hard  problem.  Further,  since  the  execution  of  the  algorithm  is  randomized,  re-application  of 
the  algorithm  by  an  adversary  requires  reproducing  the  algorithm’s  random  sequence  of  choices  (see  example 
in  Fig.  2). 

Section  3  presents  a  transformation  from  3-SAT  to  NDB,  and  in  what  follows  we  will  use  the  formalisms 
interchangeably.  In  particular,  DB  and  sets  of  assignments  will  be  used  interchangeably,  NDB  and  formula 
<f>  will  be  used  interchangeably,  and  the  output  of  the  algorithms  to  be  presented  in  this  section  can  be  viewed 
either  as  strings  in  NDB  or  clauses  in  (j>. 

Definition  4.0.3.  A  c-key  is  bit  pattern  not  present  in  DB  with  no  extraneous  bits:  A  c-key  defines 
a  minimal  pattern  in  that  the  removal  of  any  bit  yields  a  pattern  in  DB  (see  figure  2).  A  c-key  is  the 
complement  of  a  c-key. 

Definition  4.0.4.  A  c-clause  is  a  pattern  composed  of  a  c-key  plus  at  most  two  additional  specified  bit 
positions. 

Theorem  4.0.4.  Let  DB  be  a  set  of  assignments  and  a  CNF  formula.  <f>  is  satisfied  by  every  x  £  DB  iff 
every  clause  Cq  in  <f>  contains  a  c-key  with  respect  to  DB. 
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Proof.  Suppose  clause  Cq  of  </>  contains  a  c-key.  Then,  by  definition  4.0.3,  no  x  £  DB  contains  the  comple¬ 
ment  pattern  of  c-key.  Thus  each  x  £  DB  contains  at  least  one  bit  appearing  in  c-key,  hence  satisfying  the 
corresponding  literal  of  this  bit,  thus  satisfying  Cq. 

Now  suppose  each  x  £  DB  satisfies  each  clause  of  <f>  (that  is,  each  a:  is  a  satisfying  truth  assignment  for 
4>).  Suppose  to  the  contrary,  that  some  clause  Cq  does  not  contain  a  c-key.  Then,  the  complement  pattern 
of  c-key  appears  in  DB ,  and  in  particular  in  at  least  one  x  £  DB.  But  then  x  contains  no  bit  appearing  in 
c-key,  thus  failing  to  satisfy  each  of  the  corresponding  literals  in  C-q.  Hence,  we  have  a  contradiction,  and  it 
must  be  that  every  clause  Cq  contains  a  c-key. 

□ 


Randomize JVDi?  algorithm 

Let  Wi  denote  an  *-bit  prefix  and  Wt  a  set 

of  i-length  patterns. 

1.  i<-{log2(l)] 

2.  Initialize  I-lq  to  the  set  of  every  pattern 
of  i  bits. 

3.  Set  Wj_|_i  to  every  pattern  not  present  in 
DB' s  Wi+i  but  with  prefix  in  Wt 

4.  for  each  pattern  Vp  in  Wj+i  { 

5.  Randomly  choose  1  <  j  <  l 

6.  for  k  =  1  to  j  do  { 

7.  Vpg  <—  Pattern_Generate(7r(DR),  Vv) 

8.  Insert  Vpg  in  NDB.}} 

9.  Increment  i  by  one 

10.  Set  Wi  to  every  pattern  in  DB's  Wi 

11.  Return  to  step  3  as  long  as  i  <  l. 


Figure  3:  The  Randomize  JV.D.B  algorithm  randomly  generates  a  negative  database  representing  the  strings 
in  U  -  DB. 


Lemma  4.0.2.  For  every  possible  c-clause  contained  in  the  input  pattern  Vpe,  there  is  some  execution  of 
PattermGenerate  (Fig.  4)  (with  an  appropriate  sequence  of  random  choices)  that  will  generate  it. 

Proof.  For  every  pattern  Vpe  and  every  c-key  K  contained  in  Vpe  there  exists  a  permutation  tt  such  that  K 
occupies  the  \K\  rightmost  bit  positions  of  n(Vpe)  (step  1).  The  algorithm  proceeds  by  discarding  one  by 
one,  from  left  to  right,  every  bit  it  examines  for  as  long  as  there  is  a  c-key  present  within  the  remaining 
subpattern  (steps  2-6).  It  follows  that  since  K  is  a  c-key  and  occupies  the  \K\  rightmost  positions  of  7r(V^e) 
that  K  is  the  pattern  that  will  be  found3.  Steps  7-9  of  the  algorithm  generate  a  pattern  containing  I\  and 
specifying  at  most  two  other  arbitrarily  chosen  positions. 

□ 

Lemma  4.0.3.  The  Randomize  JVDi?  algorithm,  under  any  sequence  of  random  choices,  produces  an  NDB 
that  corresponds  to  an  instance  of  SAT  that  is  satisfied  exactly  by  DB. 

Proof.  Let  nsj  be  any  string  in  U  —  DB  and  let  i  be  the  length  of  the  smallest  prefix  Vp  of  nsj  that  is  absent 
from  DB.  The  algorithm  will  find  this  prefix  at  iteration  i  (line  3)  and  create  at  least  one  distinct  string 
with  a  subpattern  p  of  Vp  that  is  absent  from  DB  (steps  4-8). 

3Note  that  it  is  not  required  for  the  c-key  to  be  contiguous  or  to  occupy  the  rightmost  bits  to  be  found.  Its  only  convenient 
to  focus  on  this  case  for  the  proof. 
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Pattern_Generate(Z?B,  Vpe) 

1.  Find  a  random  permutation  n. 

2.  for  i  =  1  to  | Vj,e  |  do  { 

3.  Determine  whether  7r(V^,e)  without  its  ith 
bit  is  in  tt(DB) 

4.  if  not  in  ir(DB){ 

5.  7r(Fpe)  <—  t r{Vpe)  -  ith  bit 

6.  Keep  track  of  the  ith  bit  in  a  set 
indicator  vector  (SIV)  }} 

7.  Randomly  choose  0  <  t  <  2 

8.  if  t  >  \SIV\  then 

9.  R  All  bits  from  SIV 
else 

10  R  t  randomly  selected  bits  from  SIV 

11.  Create  a  pattern  14  using  Tr(ype), 

the  bits  indicated  by  R  and  “don’t  care” 
symbols  in  the  remaining  positions. 

12.  Return  7r'(V4)  (n1  is  the  inverse 
permutation  of  7r) . 


Figure  4:  Pattern_Generate  produces  a  string  over  {0, 1,  *}  with  at  most  two  extraneous  bits,  matching  Vpe 
and  not  matching  any  string  in  DB. 


If  p  is  not  found  in  DB  then  p  must  be  different  in  at  least  one  bit  form  every  pattern  in  DB  and  p  must 
match  every  string  in  DB  in  at  least  one  position.  Our  mapping  to  SAT  creates  clauses  that  correspond  to 
p  (see  Fig.  4  and  lemma  4.0.3)and  are  thus  satisfied  by  every  string  in  DB  and  unsatisfied  by  nsj  (for  all 
nSj  €  U  —  DB) . 

□ 

Lemma  4.0.4.  The  RNDB  algorithm  can  generate  any  formula  of  at  most  n  c-clauses  containing  solely 
the  n  variables  present  in  window  wn  (when  wn  is  the  first  window  considered)  that  is  satisfied  exactly  by 
DBa,  where  DBa  consists  of  all  the  n  length  prefixes  of  the  strings  in  DB. 

Proof.  Let  0  be  a  formula  satisfied  exactly  by  DBa  and  C\ . .  .Cn  the  c-clauses  composing  cj>.  Ua  —  DBa. 
For  any  c-clause  Cq  in  <j>,  the  complement  pattern  does  not  satisfy  it.  By  definition,  any  string  containing  the 
complement  of  Cq  is  in  Ua  —  DBa  and  every  string  containing  the  complement  pattern  of  Cq  is  considered 
by  the  algorithm,  and  can  generate  Cq.  Note  that  each  call  to  Pattern_Generate  (Fig.  4)  returns  only  one 
clause.  However,  up  to  n  <  l  calls  are  made  on  the  same  Vp ,  so  even  if  all  n  clauses  in  <p  must  come  from 
the  same  Vp,  there  are  sufficient  calls  to  account  for  them. 

Further,  no  clause  not  in  <f>  need  be  generated  when  wn  is  considered  because  every  string  s  in  Ua  —  DBa 
that  is  considered  for  window  wn  must  contain  (since  it  does  not  satisfy  (f> )  the  complement  pattern  of  at 
least  one  Cq  of  <j)  and  thus  is  capable,  with  an  appropriate  sequence  of  random  choices,  of  generating  this  Cq 
and  no  additional  clause  (clauses  generated  repeatedly  appear  only  once  in  the  set  of  clauses  returned). 

Note  that  if  there  exists  one  or  more  formulas  of  at  most  n  c-clauses  containing  solely  the  first  n  variables 
which  are  satisfied  by  exactly  DBa,  RNDB  will  add  no  additional  clauses  after  the  initial  window  wn  is 
considered,  because  at  future  iterations  there  will  be  no  strings  which  do  not  appear  in  u>i+ 1  that  have  a 
prefix  in  wy . 

□ 

Theorem  4.0.5.  The  RNDB  algorithm  can  generate  every  possible  3-SAT  formula  such  that  the  number 
of  clauses  is  bounded  by  the  number  of  variables. 
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Proof.  Let  4>  be  any  3-SAT  formula  of  l  variables  and  let  DB  be  the  set  of  assignments  that  exactly  satisfy 

<t>- 

For  every  database  DB  there  exists  another  database,  DB&,  such  that  DB  contains  all  the  /-length 
prefixes  of  strings  in  DB13  and  DB13  contains  every  possible  string  of  length  2l  with  those  prefixes.  The 
RNDB  algorithm  on  input  DB 13  will  set  its  initial  window  to  encompass  the  first  l  bit  positions,  by  lemma 
4.0.4  the  algorithm  can  generate  any  formula  of  at  most  l  c-clauses  containing  only  the  first  /  bit  positions 
of  DB13.  After  considering  this  first  window  the  algorithm  will  not  generate  any  more  clauses,  since  there 
are  no  additional  strings  in  U 13  —  DB 13  whose  immediate  prefix  is  contained  in  DB13.  Hence  the  RNDB 
algorithm  will  output  <j>  by  making  the  appropriate  random  choices.  □ 

Corollary  4.0.3.  The  image  of  RNDB  defines  an  AA'P-complete  restriction  of  NESR.  Similarly,  the  image 
of  RNDB  defines  an  N 'P-liard  restriction  of  ESR  and  SR. 

Proof.  By  Theorem  4.0.5  RNDB  can  generate  an  NDB  corresponding  to  (under  the  transformation  of  the 
proof  of  Theorem  3.0.2)  any  instance  of  3-SAT  in  which  the  number  of  clauses  is  bounded  by  the  number  of 
variables.  The  set  of  all  such  instances  of  3-SAT  is  known  to  define  an  N P-complete  problem.  □ 

4.1  Discussion 

Our  results  establish  that,  given  an  NDB  as  input,  it  is  an  A'P-hard  problem  to  determine  the  DB  it 
represents,  or  even  to  determine  simply  if  NDB  represents  the  empty  DB.  Note  that  this  result  does  not, 
however,  address  directly  the  irreversibility  of  the  overall  privacy  scheme:  given  a  DB  as  input,  we  wish  to 
produce  an  NDB  that  cannot  be  reversed  efficiently  in  time  measured  as  a  function  of  the  size  of  DB.  In 
particular,  the  proof  of  Lemma  4.0.4  identifies  instances  NDB  which  RNDB  can  create  from  DB  which 
may  be  logarithmic  in  the  size  of  DB.  Consequently,  a  reconstruction  algorithm  exponential  in  the  size  of 
the  representation  NDB  could  be  polynomial  in  the  size  of  the  original  DB  which  the  scheme  represents. 

It  remains  an  open  question  as  to  whether  a  randomizing  variant  of  RNDB  can  be  devised  to  achieve 
this  ultimate  irreversibility  goal.  Consider,  for  example,  the  variant  of  RNDB  shown  in  figure  5. 

This  algorithm  is  similar  to  the  one  presented  in  figure  3  ,  the  difference  (lines  7-10)  is  that  the  input 
pattern  to  PattermGenerate  is  augumented  with  at  most  three  bit  positions  outside  the  scope  of  the  current 
prefix  window.  This  enables  the  algorithm  to  find  any  c-clause  satisfied  by  DB  in  any  given  run. 

The  algorithm  produces  NDBs  that  are  polynomialy  related  in  size  to  DB ,  and  it  remains  to  explore 
whether  the  set  of  instances  produced  indeed  defines  an  AfV-hard  restriction  of  NESR.  Note  that  this  result 
is  possible  without  implying  that  MV  =  co-A fV,  because  it  may  not  be  possible  to  decide  in  deterministic 
polynomial  time  whether  an  arbitrary  input  is  an  instance  of  the  restricted  version  of  NESR  defined  by 
the  image  of  the  algorithm.  Further,  it  is  important  to  point  out  that  the  AfV- hardness  of  a  problem  is  a 
measure  of  worst  case  difficulty  and  practical  intractability  remains  to  be  ascertained. 

Finally  we  note  that  both  algorithms  presented  in  this  section  run  in  time  0(l2\DB\2)  by  observing  that 
procedure  PattermGenerate  (Fig.  4)  runs  in  time  0(l\DB\)  and  is  called  a  total  of  0(l\DB\)  times. 

5  Related  work 

There  are  several  areas  of  research  that  are  potentially  relevant  to  the  ideas  discussed  in  this  paper.  These 
include:  encryption,  privacy-preserving  databases,  privacy-preserving  data-mining,  query  restriction  and 
negative  data. 

An  obvious  starting  point  for  protecting  sensitive  data  is  the  large  body  of  work  on  cryptographic  methods, 
e.g.,  as  described  in  [27].  Some  researchers  have  investigated  how  to  combine  cryptographic  methods  with 
databases  [18,  17,  5],  for  example,  by  encrypting  each  record  with  its  own  key.  Cryptography,  however,  is 
intended  to  conceal  all  information  about  the  encrypted  data,  and  it  is  therefore  not  conducive  to  situations 
in  which  we  want  to  support  some  queries  efficiently  but  not  reveal  the  entire  database. 

Cryptosytems  founded  on  Af P-complete  problems  [16]  have  been  explored  such  as  the  Merkle-Hellman 
cryptosystem  [23],  which  is  based  on  the  general  knapsack  problem.  These  systems  rely  on  a  series  of  tricks 


Randomize_c-clause  algorithm 

Let  Wi  denote  a  i  bit  prefix  and  Wi  a  set 
of  i  length  patterns. 

1.  i<-\log2(l)] 

2.  Initialize  Wi  to  the  set  of  every  pattern 
of  i  bits. 

3.  Set  Wi+ 1  to  every  pattern  not  present  in 
DB's  vj1+i  but  with  prefix  in  Wi 

4.  for  each  pattern  Vp  in  Wi+ 1  { 

5.  Randomly  choose  1  <  j  <  l 

6.  for  k  =  1  to  j  do  { 

7.  Randomly  select  at  most  three  bit 
positions  a,  6,  c  s.t.  i  <  a,b,c  <  l 

8.  for  every  possible  bit  assignment  Bp 

of  the  selected  positions! 

9.  Vpe  <—  Vp  ■  Bp 

10.  Vpg  <—  Pattern_Generate(7r(Di?),  Vpe) 

11.  Insert  Vpg  in  NDB.}}} 

12.  Increment  i  by  one 

13.  Set  Wi  to  every  pattern  in  DB' s  w; 

14.  Return  to  step  3  as  long  as  i  <  l. 


Figure  5:  The  Randomize_c-clause  algorithm  randomly  generates  a  negative  database  representing  the  strings 
in  U  —  DB  and  its  capable  of  producing  every  possible  c-clause. 


to  concel  the  existence  of  a  “trapdoor”  that  permits  retrieving  the  hidden  information.  However,  almost 
all  knapsack  cyptosystems  have  been  broken  [26],  and  it  has  been  shown  [7,  8]  that  in  general  if  breaking  a 
cryptosystem  is  MV- hard  then  MV=coMV ,  a  point  addressed  in  Section  4. 

If  a  scheme  based  on  a  WP-liard  result,  such  as  the  one  proposed  here,  is  to  be  used  in  a  privacy  setting  it 
will  be  indispensable  to  study  under  what  situations  does  it  indeed  produce  hard  to  reverse  instances.  In  the 
case  of  negative  databases  there  is  a  large  body  of  literature  that  addresses  this  issue  due  to  its  isomorphism 
with  the  satisfability  (SAT)  problem  [24,  11]. 

Of  particular  relevance,  then,  are  one-way  functions  [20,  25] — functions  that  are  easy  to  compute  but  hard 
to  reverse  and  one-way  accumulators  [4,  9]  which  are  essentially  one-way  hash  functions  with  the  property 
of  being  commutative.  One  key  distinction  between  these  existing  methods  and  the  negative  database  is 
that  the  output  of  a  one-way  function  is  usually  compact  and  the  message  it  encodes  typically  has  a  unique 
representation.  Representing  data  negatively,  as  described  here,  permits  a  message  to  be  encoded  in  several 
ways  and  one  is  chosen  randomly  (an  idea  used  in  probabilistic  encryption  [21,  6]). 

In  privacy-preserving  data  mining ,  the  goal  is  to  protect  the  confidentiality  of  individual  data  while  still 
supporting  certain  data-mining  operations,  for  example,  the  computation  of  aggregate  statistical  properties 
[3,  2,  1,  12,  14,  29,  28].  In  one  example  of  this  approach  (ref.  [3]),  relevant  statistical  distributions  are 
preserved,  but  the  detals  of  individual  records  are  obscured.  Our  method  is  almost  the  reverse  of  this 
approach,  in  that  we  support  efficient  membership  queries  but  higher- level  queries  may  be  expensive. 

Our  method  is  also  related  to  query  restriction[22,  10,  12,  13,  28],  where  the  query  language  is  designed 
to  support  only  the  desired  classes  of  queries.  Although  query  restriction  controls  access  to  the  data  by 
outside  users,  it  cannot  protect  an  insider  with  full  privileges  from  inspecting  individual  records  to  retrieve 
information. 

The  term  “negative  data”  sounds  similar  to  our  method,  but  is  actually  quite  different.  The  deductive 
database  model  (e.g.,  [19]  presents  an  excellent  survey  of  the  foundations  of  the  model)  supports  in  the  inten- 
sional  database  (IDB)  the  negative  representation  of  data.  The  objectives,  mechanisms,  and  consequences 
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here  are  quite  different  from  our  scheme.  In  a  deductive  database,  traditional  motivations  for  “negative 
data”  include  reducing  space  utilization,  speeding  query  processing,  and  the  specification  and  enforcement 
of  integrity  constraints. 

To  summarize,  the  existence  of  sensitive  data  requires  some  method  for  controlling  access  to  individual 
records.  The  overall  goal  is  that  the  contents  of  a  database  be  available  for  appropriate  analysis  and 
consultation  without  revealing  information  inappropriately.  Satisfying  both  requirements  usually  entails 
some  compromise  such  as  degrading  the  detail  of  the  stored  information,  limiting  the  power  of  queries,  or 
database  encryption. 

6  Discussion  and  Conclusions 

In  this  paper  we  have  established  the  feasibility  of  a  new  approach  to  representing  information.  Specifically, 
we  have  shown  that  negative  representations  are  computationally  feasible  and  that  they  can  be  difficult  to 
reverse.  However,  there  are  many  important  questions  and  issues  remaining. 

Which  classes  of  queries  can  be  computed  efficiently  and  which  cannot?  Our  initial  results  address 
two  extremes — the  case  of  testing  membership  for  a  specific,  single  record  and  the  case  of  reconstructing 
the  entire  positive  database.  We  would  like  to  understand  the  computational  complexity  at  points  across 
the  spectrum  between  these  two  extremes,  as  well  as  understanding  what  computational  properties  are 
desirable  in  a  privacy-protecting  context.  A  related  question  involves  the  costs  of  database  updates  under 
our  representation.  How  expensive  is  it  to  insert  or  delete  entries  from  the  negative  database  under  the 
different  representations? 

Are  there  other  better  representations  of  NDB1  Once  we  understand  more  completely  the  computational 
properties  of  our  current  representations,  we  may  be  able  to  devise  other  representations  whose  properties 
are  more  appropriate  for  some  applications. 

An  important  feature  of  NDB  is  its  distributability  in  which  the  NDB  is  partitioned  into  disjoint  sets, 
or  fragments.  In  a  distributed  NDB ,  positive  membership  queries  can  be  processed  with  no  communication 
among  the  database  fragments.  We  would  like  to  study  this  property  in  more  detail. 

Finally,  we  are  interested  in  inexact  representations.  The  NDB  representation  is  closely  related  to  partial 
match  detection  [15]  which  has  many  applications  in  anomaly  detection.  We  are  interested  in  studying  how 
those  methods  might  be  combined  with  NDB  either  for  designing  an  adaptive  query  mechanism  or  for 
approximate  databases. 

In  conclusion,  although  we  have  shown  that  negative  representations  of  data  are  computationally  feasible 
and  in  some  cases  difficult  to  reverse,  there  are  many  possible  avenues  for  future  work.  By  tailoring  a 
negative  representation  to  particular  requirements,  we  are  optimistic  that  we  can  address  at  least  some  of 
the  problems  presented  by  large  collections  of  sensitive  data. 
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